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Abstract 

We present a novel Bayesian nonparametric regression model for covariates X and continuous 
response variable V £ R. The model is parametrized in terms of marginal distributions for Y and X 
and a regression function which tunes the stochastic ordering of the conditional distributions F{y\x). 
By adopting an approximate composite likelihood approach, we show that the resulting posterior 
inference can be decoupled for the separate components of the model. This procedure can scale to 
very large datasets and allows for the use of standard, existing, software from Bayesian nonparametric 
density estimation and Plackett-Luce ranking estimation to be applied. As an illustration, we show 
an application of our approach to a US Census dataset, with over 1,300,000 data points and more 
than 100 covariates. 


1 Introduction 

Bayesian nonparametric regression offers a flexible and robust way of modeling the dependence between 
covariates x € X and a response variable V £ M by using models with larger support than their parametric 
counterparts. Nonparametric statistical models are motivated by robustness and their ability to capture 
effects such as outliers, strong nonlinearities or multimodalities, while providing probabilistic measures 
of predictive uncertainty. Bayesian nonparametric regression methods are largely underpinned by one 
of two random probability measures namely, Dirichlet process mixtures (Ferguson, 1973; Lo, 1984) and 
Polya trees (Lavine, 1992, 1994). These approaches, widely applied to density estimation problems (see 
e.g. Hjort et al., 2010), have been used as building blocks of various nonparametric regression models 
through a number of different approaches. 

One approach, called the conditional approach, considers the covariates as fixed, and models directly 
the conditional distribution f{y\x) of the response given the covariate . This conditional distribution 
may be constructed in a semiparametric or fully nonparametric way. The semiparametric conditional 
approach typically assumes that 


Y = ri{x) + € (1) 

where rj is some unknown flexible mean function and e is the residual. Regression models (priors) have 
been proposed for the mean function y such as Gaussian processes (see e.g. Rasmussen, 2006), basis 
function representations such as splines or kernels (Denison et ai, 2002; Muller & Quintana, 2004) or 
Bayesian regression trees (Chipman et al., 2010). More generally, Kottas & Gelfand (2001) and Lavine & 
Mockus (1995) proposed to use Dirichlet process mixtures for the distribution of the residuals, while Pati 
& Dunson (2014) jointly model the mean function and residual distribution using Gaussian processes and 
probit stick-breaking processes (Ghung & Dunson, 2009). The fully nonparametric conditional approach 
considers that f{y\x) = Jq f{'y\x,9)Px{d9) takes the form of a mixture model with unknown mixing 
distribution for 9. A prior is set on the family of probability distributions {Px)xex- In particular, 
following the seminal work of MacEachern (1999), various dependent Dirichlet process models have been 
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proposed in the literature (Gelfand & Kottas, 2003; Griffin & Steel, 2006; Dunson et al., 2007; Caron 
et al, 2007, 2008; Dunson & Park, 2008). Similarly, Trippa et al (2011) define a class of dependent 
random probability distributions using Polya trees. 

An alternative to the conditional approach is to treat the covariates as random variables and to 
build a joint statistical model for (X, Y). In this way, one can cast the regression problem as a density 
estimation one. For example, Miiller et al. (1996) proposed to use Dirichlet process mixtures for the joint 
distribution of {X,Y). This approach was later extended by Shahbaba & Neal (2009), Hannah et al 
(2011) and Wade et al (2014). 

A major drawback of current Bayesian methods for semi or nonparametric regression is that many 
methods do not scale well with the number of samples and/or with the dimensionality of the covariates. 
In this paper, we propose a novel joint Bayesian nonparametric regression model Fx,y that affords an 
approximation which can scale easily to large data applications. The model is parameterized in terms 
of the marginal distributions of the response Fy and covariates Fx, and then a conditional regression 
model that utilises the two marginal distributions. 


Fx 

^Fx 

(2) 

Fy 

^¥y 

(3) 

p 

- TTfi 

(4) 

Fx,y{x,y) 

= Cxp{Fx{x),Fy{y)) 

(5) 


where Px and Pv are some nonparametric prior over probability distributions, A/j : A —)■ IR_|_ is some 
parametric regression function of the covariates, and Cxi^ plays a role similar to a copula in that it takes 
marginal distributions as inputs and characterises the dependence between them using the function A^. 
In particular we consider a Plackett-Luce model for ranks for the regression structure. This construction, 
detailed in Section 3, builds on the original Plackett-Luce model (Luce, 1959; Plackett, 1975) for ranking. 
The positive function Xp tunes the stochastic ordering of the responses given the covariates, the ratio 
Xp{Xi)/ {\p{Xi) + \p{Xj)) representing the conditional probability, Pr{Yi < Yj\Xi,Xj), that response 
Yi is less than response Yj given knowledge of {Xi,Xj}. There is thus a natural interpretation of the 
parameters: X/s tunes the relative ordering of the responses at different covariate values, and Fy sets the 
marginal distribution of the responses. This strong interpretability is an important feature as it provides 
a good vehicle for specifying prior beliefs. 

For inference we propose to use a marginal composite likelihood approach, which we show allows the 
model to scale tractably to large data applications and allows for the use of standard, existing, software 
from Bayesian nonparametric density estimation and Plackett-Luce ranking estimation to be applied. 
As an illustration, we show an application of our approach to a US Gensus dataset, with over 1,300,000 
data points and more than 100 covariates. 

The paper is organized as follows. Section 2 provides background on Dirichlet process mixtures and 
Polya trees for density estimation. Section 3 describes the Plackett-Luce copula model. The marginal 
composite likelihood approach for scalable inference is presented in Section 4. Section 5 presents some 
results of our approach on simulated data and on the US Census dataset. 

2 Bayesian nonparametric density estimation 

The appeal of Bayesian nonparametric models is the large support and probabilistic inference provided by 
such priors. This both safeguards against model misspecification and enables highly flexible estimation 
of distributions. This has lead to particular popularity of Bayesian nonparametric priors in density 
estimation. 

In the simple case of density estimation for a real valued random variable many nonparametric priors 
exist - see Hjort et al. (2010) for a recent review. A popular class of model is the Dirichlet Process 
Mixture (Lo (1984)), whereby a Dirichlet process prior is placed on the distribution of the parameters 
of a parametric family. The result is an “infinite mixture model”. Precisely: 

fy{y) = I K{y\e)dPi0) 

P~DP(c,Po) 
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where K is the density of the chosen parametric family, c > 0 is a scale parameter and Pq is a base 
measure. Since draws from a Dirichlet Process are almost surely atomic measures, there is positive 
probability of observations sharing a parameter value given the random measure P. The result is an 
effect of clustering within a sample, with a random, limitless number of clusters. This has proved to 
be an extremely popular model as it models heterogeneity within a sample well, and provides a highly 
flexible support. Efficient MCMC schemes (Escobar & West (1995); MacEachern & Muller (1998); Neal 
(2000)) have lead to the widespread use of the Dirichlet Process Mixture (DPM) in density estimation. 

Polya trees provide another flexible nonparametric prior for density estimation (Ferguson (1974); 
Lavine (1992, 1994); Mauldin et al. (1992)). They are defined as follows: Let e = (ei,...,efe) e = 

{0,1}*, and define a sequence of embedded partitions of M to be T^, = (Bg : e G E*}, where the 
are defined recursively, such that i?eo U Bei = B^. Now let E* = Ufc>iE*, the set of all countable 
sequences of zeros and ones, and let A = {a^ '■ e € E*} be a set of nonnegative real numbers. Then, 
a random probability measure P is a Polya tree process with respect to P = {Pfc : A: > 1} and A if 
P{Beo I Be) ~ Beta(ae 0 5 Q^ei)) independently for all e € E*. There are two properties of the Polya tree 
process that are appealing for density estimation: Polya trees are conjugate, meaning that both the prior 
and the posterior have the same functional form, and, for certain choices of A, realizations are absolutely 
continuous probability distributions, almost surely. It is worth pointing out that empirically the model 
can depend heavily on the defined sequence of partitions E, although a mixture of Polya trees proposed 
by Lavine (1992) can smooth out this dependence over multiple partitions. In what follows we make use 
of these nonparametric models to specify priors for the marginal distributions of covariates and response 
variables. 


3 The statistical model 

Let {Xi, Yi), i = 1,..., n be the covariates and responses and regression function : T —>■ IR+. To build 
the dependence we introduce a latent random variable Zi that is used to capture the underlying relative 
level of the response via, 


Zi\X, = Xj ~ Exp(A/3(xj)) (6) 

where Exp(a) denotes the standard exponential distribution of rate a. The latent variable Zi may be 
interpreted as an “arrival time” of individual i. The arrival times then define a conditional ranking of 
the predicted response variables 1},..., E„. 

The model can be summarized as follows, for i = 1,... ,n 


Af. ~ Fx 

( 7 ) 

Z, 1 Exp(A;3(W)) 

( 8 ) 

Y, = Fy\Fz{Zi)) 

( 9 ) 


where 


Fz{z) 


lx 


Fz\x=x{z)dFx{x) 



X _ g-^/3(a;)z 


dFx{x). 


Figure (1) shows the correspondence between the conditional exponential random variables, Z\X, 
shown in 1(a) for differing covariate values, and the resulting predictive distributions in 1(b), where 
the marginal Fy is a Gaussian mixture model shown as the black line. We can see visually that the 
distributions in 1(b) are stochastically ordered under the model. The coloured points shown in (a) are 
mapped to the points shown in (b), where again ordering is preserved. 

As Fy and Fz are cumulative density functions, Fy^ o Fz is a monotonically increasing function and 


P(b} <Yj)= F{Z, < Zj) 


A/3(Xi) 

^fni^Xi) ^(nixj) 


This clarifies the role of the regression function. More generally, given an ordering v = (^i,..., vA) (a 
permutation of {1, 2,..., n}), we have 
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(a) 


(b) 


Figure 1: Illustration of the latent variable used to capture the regression dependence. In (a) we show the 
distribution of the conditional latent variable, Z, at various points in X assuming a log-linear dependence. In (b) 
we see the corresponding predictive distributions using a Gaussian mixture model for the marginal, FV, shown 
as the black line. The points in Z shown in (a) are mapped to the points in Y shown in (b) where the ordering 
is preserved. 


The above model is the Plackett-Luce model (Luce, 1959; Plackett, 1975), popular in the ranking 
literature, and also corresponds to the partial likelihood used for Cox proportional hazards models (Cox, 
1972). 

By construction Fz{Zi) is marginally uniformly distributed on [0,1]. Thus, Yi = Fy\Fz{Zi)) is 
marginally distributed from Fy. The joint distribution Fxy can thus be described in terms of marginals 
Fx and Fy and a Plackett-Luce copula C\^ such that 

Fx,Yix,y) = CxpiFxix),Fy{y)). 

The Plackett-Luce copula takes the following form 

nUi 

Cx^{ui,U2) = Ui - exp{-Xp{u})Fz^{u2)) duj. ( 10 ) 

J oj —0 

Figure 2 shows illustration of the copula for different functions A^. 

The conditional distribution function can then be expressed as 

Fy\x=xiy) = 1 - exp{-Xfiix)F-\Fy{y))). 

Given Xp, the random variables Y\X = x are stochastically ordered. For xi,X 2 such that Xp{xi) < 
Xi3{x2) 


Fy\x=xAy) < Fyix=x2{y) ^y g P- 


If Fy has a density with respect to Lebesgue measure, fy, then we can use a change of variables to 
calculate the conditional density as follows: 


r I \ t t xfz\X = x{z{y)) 

fY\x=x{y) = /y(y “r7~rvr“ 
fz{z(.y)) 


= fxiy) 


= fxiy) 


fz\x=xiFz\Fyiy))) 

fz{F^\Fy{y))) 

Xpix)exp[-Xp{x)F-^{Fy{y))] 

Xfiix') exp[-Xi3{x')Fz^{Fy{y))]dFx{x') 
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(a) \p(x) = 1 


(b) Xpix) = .01 if X < 0.5, 1 (c) \p{x) = exp(—100(x — (d) \p(x) = exp(—lOOx) 

otherwise 



(e) 


(f) 


(g) 


(h) 


Figure 2: Examples of the Plackett-Luce copula for different functions \p. The top figures (a-d) plot the different 
functions \p. The bottom figures (e-h) represent samples from the copula C\^(ux,Xy), where X £ [0,1] and Fx 
is uniform. 


It can be seen from this representation that the conditional density of Y^, given Xi is simply the 
marginal density of Yi, re-weighted across its quantiles (Fy(j/)) by a function of X^. 

We end the construction of the model by assuming a prior over the finite-dimensional parameter f3 
and Bayesian nonparametric prior over the marginal distributions Fx and Fy 




(11) 

Fy n 

'X Py 

(12) 

Fx '' 

Px 

(13) 


where is some parametric prior and Px and Pv may be a Dirichlet process mixture or a Polya tree 
prior, as described in Section 2. 


4 Approximations for posterior inference and prediction 


Assume that both Fx and Fy admit a density with respect to Lebesgue measure, noted fx and fy. The 
unknown quantities for our regression model are therefore (fv, P, fx)- Given data {xi-,n,yi-.n), where 
Xi:n = (a^i, ■ • ■, Xn) and yi:n = (yi,..., 2 /„), we have the following likelihood: 


L{fY,P,fx] (.Xi..n,yi-n)) =WfY{yi) 


Z=1 


Xpjxi) exp[-Aff^(fy(yi))] 
^^\p{x')eiq^[-\p{x')Fz^{Fy{yi))\dFx{x' 


fx{xi). 


(14) 


Inference could proceed using numerical methods such as MCMC but for large datasets this is cumber¬ 
some. Hence we consider here a Bayesian composite marginal likelihood approach (Lindsay, 1988; Cox & 
Reid, 2004; Varin et ai, 2011; Pauli et ai, 2011; Ribatet et al., 2012) that we show offers computational 
tractability and the use of standard Bayesian methods. Define ?/*.„ to be j/i:„ ordered from lowest to 
highest, and let = (i^i,..., Vn) be a vector representing the order of yi:„, so that y* = y^.. Then 
we can re-write our data {yi-.n,xi-.n} equivalently as Now let Lc denote the compos¬ 
ite marginal likelihood based on and {vi:n,xi-.n}- That is the product of the likelihood terms 

associated with each of these terms: 


Lc{fY,P,fx;{yi:n,Xl-.n}) = L{fy, ^, fx] {yl.n}) X Hfx , P, fx] Xy.n}) 


= n! 


Y[.fY{y^) 




n 


Xy {xi>^ 




.i=i 


Y[fx{x^) 




(15) 
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We can see that this composite likelihood approach factors the likelihood into separate terms involving 
/v,/? and fx, leading to the following pseudo posterior distribution 


T^c{fY,l3,fx I {yi:n,Xl:n}) = T^c{fY\yt.n)T^c{fx\xi:n)T^c{l3Wl:n,Xl-.n) (16) 

Inference over the parameters fY,P,fx can thus be carried out independently under the composite 
likelihood approach. Standard software for Bayesian nonparametric univariate density estimation can be 
used for fy and fx, and software for fitting Plackett-Luce/Cox proportional hazard can be used for fitting 
p. Overall the advantages of the approximate composite likelihood approach include computational 
tractability and scalable inference using standard software, hence good numerical reproducibility, and 
high interpretability as the components in the composite likelihood have explicit form and meaning. This 
latter point aids in prior elicitation as it allows the analyst to separate out and represent their beliefs 
on the marginal distributions, which are simpler to specify than the full conditionals, and then consider 
the dependence given the marginals. 

The Bayesian composite likelihood approach has attracted some attention over recent years (Pauli 
et al., 2011; Varin et ai, 2011; Ribatet et at, 2012). In particular, Ribatet et al. (2012) considered two 
adjustements to the marginal likelihood approach in order to retain some of the desirable properties of 
the usual likelihood. However, their adjustments apply to a specific form of composite likelihood, where 
it factorizes as a product of composite likelihoods for each observation: L*°*“*(?/|0) == nr=i ^c{yi\S) where 
Lc{yi\0) is the composite likelihood for observation i. Our composite likelihood approach does not fit 
in this framework, as we do not have this product form over the observations, and we cannot therefore 
apply the adjustments suggested by Ribatet et al. (2012). Extending the adjustment of Ribatet et al. 
(2012) to our framework is an interesting direction, but beyond the scope of this article. 

4.1 Asymptotics for the marginal composite posteriors 

Consider first the pseudo-posterior for fy: 


T^cifY I {yi:n,Xl,n}) OC Tr{fy)Lc{fY', {yi:n, Xl,n}) 

n 

« 7r(/y) J|/y(y*). 
i=l 

So our pseudo-posterior is exactly the posterior based on the i.i.d sample {yi-.n}, where yyn ~ Fy. 
This is the standard setting for posterior inference, so we can apply consistency results from Bayesian 
nonparametric inference for Fy, see for example Ghosal & Van der Vaart (2013). The same is true for 
fx. Now consider the log-linear form for A: A(a;) = exp(—/3a;). Then, we have the pseudo-posterior: 


I {yi-.n, Xi-.n]) OC 7 r(/ 3 )Lc(/ 3 ; {//!:„, Xyn}) 


r(/3)n 


p/3a:,. 


J.J. y- 


j>i 


This is exactly the posterior considered by Kim (2006) in a different setting where a Bernstein-Von Mises 
theorem is proven, which can be applied here. 


4.2 Posterior predictive 

We can use simulation methods such as MCMC to easily generate samples {Fy \ from the 

pseudo-posterior (16); the predictive distribution can then be approximated by 

m 

P{y' I x' ,{yi..n,Xi..n}) Ci — ^p{y' \ x', F^^'’) . 

3=1 

To simulate from this distribution, we can use the forward generating process of our model, given 
V = x': 
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Z'|X' = a;'~Exp (A;3 (x')) (17) 

Y'= Fy\Fz{Z')). (18) 

In many applications, modeling Fx might be cumbersome, and not the primary object of interest. In 
this case we propose to use an empirical Bayes approach by setting Fx = Fx at the empirical CDF. So, 
to generate a posterior predictive sample, given a posterior sample {Fy \ Eq. (18) becomes: 

Y'U) = ( 1 - - ^ 

\ ” i=i 

where we note that is conditional on X' = x', and the CDF inversion is tractable, depending on the 
form of Fy. Alternately one can use Monte Carlo to draw samples from the predictive, which is trivial 
when Fy can be sampled from. Some particular examples are discussed in Appendix A. 

5 Illustrations 

In this Section we apply our method to two examples. The first is a simulation example where we 
generate from a multi-modal conditional and explore the ability of our method to fit the data. The 
second is a large real-world application in the regression analysis of US Census data. 

5.1 Simulation example 

In this section we apply the model to a dataset simulated from our model to consider how well we 
can recover known dependence. The marginal distribution of Y is set to a mixture of three Gaussian 
distributions, with means 3, 9 and 15, standard deviations of 2, 0.5 and 1 with mixture weights of 0.5, 
0.2 and 0.3 respectively. /3 is set to 0.25, with \p{x) = exp(/3x). X ^ Unif(0, 20) and n = 500. The data 
is shown in Figure 3(a). 

Clearly any type of linear or non-linear regression with a parametric noise distribution will be in¬ 
appropriate here. The conditional distribution of Y given x is multi-modal, rendering many popular 
regression models inappropriate. 


- I- 


■•••■« • , ••• 

• I \ . 

* • • •• • . 

' • ' - • 

• • •• V 

• . ■» ' .vr Vvift*-. 


.• ■ N 



(a) Data 


(b) 80 % highest posterior predictive intervals 


Figure 3: (a) Data simulated from the model with mixture of three Gaussians marginal distribution for Y. (b) 
80% highest posterior density intervals of the predictive distribution at each value of x. 


We compared our approach to a linear dependent Dirichlet process mixture of normals (LDDPM) 
(De lorio et al., 2004), using the R package DPpackage (Jara et al, 2011; Jara, 2007). This model 
specifies that 
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J Af{yi;XiP,a‘^)G{dl3,da‘^) 

G I a, fj.b, Sb ~ DP(q;Go) 

where Gq = JV{nb, Sb) Gamma(Ti/2, T 2 / 2 ) and 

s& I '0 ~ IW{v,^p) 

with a = l,fj,b = i 9 , 0 y,iy = 4, n = 1, r2 = 2, V' = (J 1) and Sb = 36 )■ 

We apply our model, modeling the marginal as a Dirichlet Process mixture of Gaussian distributions 
using a = 1 and a normal-inverted-Wishart distribution for the base measure. That is, our base measure 
Go{fi,a^) I fii, ^)IW{a^ \ where fii =9 ,ki = 0.5, = 4 and '0i = 1. A Gaussian prior 

centered at 0 with unit variance is used for j3. 




(a) (b) 

Figure 4: (a) The posterior predictive marginal for y under our model in blue, compared to the actual sampling 
distribution in black, (b) The posterior distribution for /3, compared to the true value of 0.25 marked in red. 

In Figure 3(b) the simulated data is shown, with the 80% highest posterior density (HPD) intervals 
of the predictive distribution at each value of x. Qualitatively we see that the model can capture the 
nonlinearities in the data and demonstrates the flexibility to model the multi-modal conditional response. 
In Figure 4 we show the predictive marginal, Fy and the posterior distribution for /3. Glearly the marginal 
distribution for Y is very well recovered from the data. This parameterization of the model in terms of 
the marginal distribution for the response allows this to be estimated from the complete dataset, without 
reliance on other aspects of the model. The strength of information available is apparent in the quality of 
the fit to the sampling distribution. The posterior for the parameter /3 shows reasonable support around 
the true value, being slightly pulled towards 0 by the prior. 

We can further inspect how these come together in the posterior predictive conditional distribution 
for Y given x. Gonsider this distribution for x = 5 and x = 12, for both our model and the linear 
DDF mixture model, as shown in Figure 5. Again, our model provides a reasonable fit. The predictive 
distribution is not as accurate as the marginal distribution for Y, but this is to be expected, since the 
conditional distribution is a product of the whole model, compounding uncertainties from both /3 and 
the marginal distribution for y. Nonetheless, the fit is good and noticeably better than the flexible 
linear DDF mixture, as you would expect, given that the sampling distribution is within the support 
of our model. Goncretely, the Ll-distance between the estimated conditional and the true conditional 
distribution can be calculated in each case. When a; = 5 the distance to our prediction is 0.00869, 
whereas the distance to the linear DDF is 0.0214, and when a: = 12 the distance to our prediction is 
0.0127 and the distance to the linear DDF is 0.0146. 
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(a) X = 5 (b) X = 12 

Figure 5: Predictive densities for (a) a: = 5 and (b) x = 12. The true predictive is shown in black, the predictive 
distributions under our model in blue, and the predictive under the linear DDP mixture in green. 


A point of note is that these posterior predictive plots are smoothed kernel density estimates of MCMC 
samples. Therefore, Gaussian shapes are slightly exaggerated. Whilst not entirely clear from the plot, 
both our predictive and the sampling distribution comprise of slightly skewed Gaussian distributions, 
since the conditional distribution is the marginal distribution for Y weighted across the quantiles. 



Figure 6: Simulated data from a linear model with Gaussian residuals (black dots). 80% HDP intervals of the 
predictive distribution of our model at each value of x are represented in blue, and true HDP interval in black 

To illustrate that the model is capable of modeling a range of distributions, we consider data sampled 
from a Gaussian linear model. The covariates are simulated uniformly on [0,10], with Y ~ A^(3 + 2x, 2) 
and n = 300. We use our model, modeling the marginal for Y with a Polya tree prior whose partition is 
set on a Gaussian distribution with mean 12.5 and standard deviation 6, and = m^. A Gaussian 

prior centered at 0 with variance 8 is used for for /3. The posterior predictive 80% HPD intervals display 
a reasonable fit of the linear data, shown in Figure 6. The variance seems slightly inflated, but this is a 
consequence of the large support of the model. 
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5.2 US Census application 

We apply the methodology to a regression task using US census data^ for personal annual income. 

We use the American Community Survey data from 2013, which comprises of responses to questions 
on the survey given to a 1% sample of the US population. Since we are interested in income, the subset 
of 1,371,401 employed civilians over the age of 16 is used. We have used a relevant, linearly independent 
subset of the data as covariates, excluding highly informative questions such as occupation, which would 
almost completely explain the response. This leaves 15 explanatory variables, 10 of which are categorical 
variables, some of which have many levels. The result is a 1,371,401 x 114 design matrix. 

The covariates are: US state (Texas as a baseline), weight, age, class of worker (employee of private 
for-profit company as a baseline), travel time to work, means of transportation to work (works from home 
as a baseline), language other than English spoken at home (no as a baseline), marital status (married 
as a baseline), educational attainment (regular high school diploma as a baseline), gender (male as a 
baseline), hours worked a week, weeks worked last year, disability status (without a disability as a 
baseline), quarter of birth (first quarter as a baseline), and world area of birth (United States of America 
as a baseline). 



2e+05 3e+05 4e+05 

Annual income in Dollars 


40000 60000 80000 

Annual income in Dollars 


100000 120000 


Figure 7: Histograms of annual income on different scales. Right hand plot is zoomed in on incomes up to 
120 , 000 . 


The levels of annual income shown in Figure (7) can be seen to be heavy tailed, which requires a flexible 
model to capture. Another noticeable feature of the data is that the income levels are discontinuous, with 
large spikes in frequency at particular income levels. This could in part be due to standardized salary 
structures resulting in certain salary levels becoming common. This motivates the use of a nonparametric 
approach as it is difficult to imagine how a parametric density could conditionally capture the features 
shown in Figure (7). However, standard Bayesian nonparametric models simply cannot be applied to 
a problem of this scale. Attempting to apply existing methods in this literature, such as the linear 
dependent Dirichlet process mixture, failed to run due to the dimensionality and scale of the data. 

For the analysis, we consider both the empirical distribution function and a Polya tree prior for 
the marginal distribution of y. The partition of the Polya tree is set on the quantiles of a Gaussian 
distribution with mean 35,000 and standard deviation 20,000, and = w?. We use a log-linear 

regression function A(a:) = exp(/3a;) and place independent Gaussian priors with mean 0 and unit variance 
on the coefficients in /?. 


5.2.1 Predictive performance 

We compare the out-of-sample predictive performance of our model with three competing non-Bayesian 
approaches namely, a standard linear regression model, a median regression model and a LASSO^. For 

^http://www.census.gov/acs/www/data_documentation/pums_data/ 

^These models were fitted in R using the functions Im, rq (from the quantreg package) and lars (from the lars package). 
For LASSO the regularization parameter was chosen using cv.lars. Default settings were used for each. 


10 



our model we investigated three distinct priors for the marginal distribution of the response: a Polya 
tree centred on a Gaussian, a Polya tree centred on a Laplace, and an empirical Bayes approach using 
the empirical CDF. To compare methods we use repeated random subsets of 1000 test samples and train 
each model on the remaining data, with 10 repeats. Predictive accuracy is judged by mean squared- 
error (MSE), mean absolute error (MAE) and qualitatively via a qq-plot. To create the qq-plots we 
compute the predictive distribution function F{y\x) evaluated at the observed value for each of these 
test samples. Under the assumption that we have a perfect predictive distribution, these values should 
be independent uniform random variables. A deviation from this distribution implies a mis-match of the 
posterior predictive and the actual distribution. We are unable to apply this approach to the median 
regression model, as it does not provide a predictive distribution and would require htting the model 
for a large number of quantiles. In the case of the linear model we used maximum likelihood estimates 
for prediction, rather than a fully Bayesian approach. With such a large dataset the strength of any 
reasonable default prior would be significantly diminished, so this should mimic a Bayesian approach 
well. 

Table 1: Mean out-of-sample prediction errors with standard deviation of this error after 10 repetitions 



Mean square error (10®) 

Mean absolute error (10^) 

Empirical model* 

2.79 ±0.51 

2.44 ±0.15 

Polya tree (Gaussian)* 

2.81 ±0.64 

2.41 ±0.16 

Polya tree (Laplace)* 

2.71 ±0.52 

2.44 ±0.14 

Linear model 

2.66 ±0.59 

2.67 ±0.16 

LASSO 

2.99 ±0.65 

2.81 ±0.16 

Median regression 

2.99 ±0.68 

2.48 ±0.17 


Summary statistics of predictive fit are shown in Table 1. Perhaps unsurprisingly on such a large 
data set the linear model targeting the conditional mean does best on MSE but this is at the expense of 
the median under MAE. In addition, studying the predictive qq-plot in Figure(8b) shows the inadequacy 
of the linear model to provide calibrated predictions. The LASSO performs relatively poorly suggesting 
most covariates are influential for prediction, whereas the median regression whilst, as expected, provides 
relative accuracy on the MAE does so at the expense of MSE and as mentioned above suffers from 
the lack of a fully predictive model. The Bayesian nonparametric methods perform relatively well on 
both summary measures, with perhaps that based on the Laplace marginal showing greatest accuracy. 
In Figure(8a) we show the predictive qq-plot from this model, demonstrating that the full predictive 
distribution is captured well. 




(a) Proposed model with Polya tree prior (b) Linear model 

Figure 8: qq-plots (a) under our model using a Polya tree prior centred on Laplace for the marginal distribution 
of the response and (b) using a standard linear model. 


These diagnostics suggest that even non-linear regression models with parametric noise would not 
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provide a satisfactory fit for the data, since the unusual conditional distribution of the response cannot 
be captured by such models. This highlights the benefit of our nonparametric approach. 

We next consider inference for covariate effects. In order to gain a measure of the relevance of each 
covariate we quantified the concentration of the posterior probability measure away from the prior “null” 
centring of j3j = 0. To do this we estimated the Bayesian sign-probability from the posterior marginal 
for each covariate as, 


PrSigrij = max 



(19) 


where •n{j3j\-) is the posterior marginal for j3j. This measures the relative tail area in the posterior 
marginal laying to the left or right of 0. A large value of PrSign suggests there is strong evidence 
against jSj = 0. In certain respects this is akin to a Bayesian marginal version of a p-value, and is 
trivially calculated from MCMC output, or from normal approximations to the posterior distribution. 
Table 2 shows the most relevant covariates as ranked by this measure. 


Table 2: Top covariate parameters ranked by (19): the log posterior probability of the parameter being a different 
sign to the posterior mean. A negative parameter value has a positive effect on income. 

I Log probability of different sign Posterior mean 


Hours worked a week 

-1.1 

X 

10^ 

-0.044 

Weeks worked last year 

-1.0 

X 

10^ 

-0.045 

Bachelor’s degree 

-4.2 

X 

10^ 

-0.80 

Master’s degree 

-4.16 

1 X 

10^ 

-1.0 

Professional degree beyond a bachelor’s degree 

-2.7 

X 

10^ 

-1.4 

Age 

-2.6 

X 

10^ 

-0.018 

Female 

-1.8 

X 

10^ 

0.35 

Doctorate degree 

-1.5 

X 

10^ 

-1.2 

Never Married 

-1.3 

X 

10^ 

0.37 

Associate’s degree 

-6.5 

X 

10^ 

-0.39 

Travel time to work 

-3.2 

X 

10^ 

-0.0035 

1 or more years of college credit, no degree 

-2.0 

X 

10^ 

-0.18 

Self employed (incorporated) 

-2.0 

X 

10^ 

-0.30 

Grade 11 in school 

-1.8 

X 

10^ 

0.38 

Walks to work 

-1.5 

X 

10^ 

0.36 

Disabled 

-1.3 

X 

10^ 

0.19 


Unsurprisingly, hours worked a week and weeks worked last year show high certainly of a positive 
effect on income. After these, educational achievement measured via degrees unsurprisingly imply higher 
earnings compared to the regular high school diploma. Since these are part of the same variable it is 
simple to compare the effects due to these degrees. Despite Bachelor’s degree providing the most certainty 
of a positive effect, a further Professional degree beyond bachelor’s has the highest posterior mean effect. 
The ranking in Table 2 reflects the greater evidence in the data for a non-zero Bachelor effect, due to a 
much higher number of observations of those with Bachelor’s degrees, and hence lower variance in the 
effect size compared with those with a higher degree. There is also strong evidence for Female workers 
earning less than their male counterparts, as well as increasing income with age and even travel time to 
work. 

Finally, we show it is simple to provide the full posterior predictive distribution of annual income of 
somebody in the test sample, using the Polya tree model. We choose as a hypothetical person a 57 year 
old female from North Carolina, who is self employed, married, 140 lbs bodyweight, who works from 
home, speaks English at home, went to college but for less than a year, who usually works 30 hours a 
week, for 43.5 weeks last year, was born in the first quarter of the year in the USA. The structure and 
shape of the posterior predictive, represented in Figure 9, match that of the marginal distribution for Y 
in the data, just on a narrower range. 
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Figure 9: Posterior predictive distribution using a Polya tree for the marginal distribution 


6 Discussion 

We introduced a new Bayesian semiparametric regression model that is designed to scale to large data 
applications. In doing so we make use of an interpretable model for ranks, via a Plackett-Luce copula 
method, and nonparametric density models for the marginals. We used a composite marginal likelihood 
approximation that leads to a number of advantages. It affords computationally tractability, aids in the 
interpretation of the model, and makes prior specihcation explicit on known objects. 

The key to the scalability of the method is the use of the composite likelihood approximation, which 
splits the inference into two simpler tasks. The use of the Laplace approximation for the covariate effect 
and the Polya tree for the marginal response allow for fast posterior inference, without requiring any 
MCMC sampling methods. In fact, sampling methods are only used for prediction, which is by far the 
slowest part of the inference procedure. 

Going forward, it would be interesting to see if theoretical bounds on the approximation error as a 
function of sample size could be derived. It may also be possible to apply results such as those found 
in Kim (2006) to provide further guarantees of asymptotic behavior such as properties of the predictive 
distribution. In addition, it would be interesting to explore non-linear models for the regression function 
A(x), such as those based on a random forests methodology. In fact, random forests applied to the US 
Census dataset (with restricted node size to enable application to this scale) gives a highly competitive 
MSE to our tested models. This might be because random forests is able to capture interaction terms 
between covariates, which seem highly plausible a priori in this particular dataset. It will be interesting 
to incorporate such flexibility into a Bayesian nonparametric approach using Plackett-Luce regression 
functions. 
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A Details on simulating from the predictive 

In this section we provide additional on the simulation from the predictive distribution depending on the 
choice of the prior on Fy- 

A.l Empirical CDF 

It might be the case that n is so large that simply using the empirical CDF is a reasonable approximation. 
In this case, the inversion of the cdf becomes trivial, and MCMC is only required for the /3 posterior 
sample. 

• Simulate from the partial posterior 

• Z'--Exp(A^o-)(a:')) 

• Calculate := 1 — ^ 

• Set 

A.1.1 Bayesian Bootstrap 

An alternative approach might be to use a Bayesian Bootstrap on Fy. This works out similarly to using 
the empirical CDF, but we must simulate the Dirichlet weights for each of the atoms. The sampling 
scheme becomes: 

• Simulate from the partial posterior 

• Sample Z' ~ Exp(A^(j) (x')) 

• Calculate := 1 — ^ 

• Simulate (IFi, IF 2 , ■ ■ •, kF„) ~ Dirichlet(l, !,...,!) 

• Set lV,>u(j)}) 

A.1.2 Polya Trees 

Under our composite likelihood scheme, the posterior for Fy is also a Polya tree, due to conjugacy of 
the Polya tree prior. Simulation then proceeds as follows: 

• Simulate from the partial posterior 

• Z'--Exp(A^O)(x')) 

• Calculate ;= x _ i 

Then all we need to calculate is Polya trees make this easy. A sample from a Polya 

tree distribution is a random probability measure, which is constructed by assigning random masses to 
each branch in a partition tree of the space. So, given the first partition point in the tree, we can simply 
generate the random mass either side of this point, and trivially deduce which branch lies 

in. We repeat this process down the tree until we reach the truncation point often used when using 
Polya trees. 

Formally, given a Polya tree truncated at level M, let oq = 0, Bq = 1, cq = 0 and for k from 1 to M: 

• if w G [ofc, 9k{bk-i - ak-i) + Ofc-i] 

— Cfc = Cfc-iO 

— ttfe = Ofc-i 

— bk = Ofc-i + 9k{bk-i — Ofc-i) 
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• Otherwise 


— Efc = Cfc-ll 

— Uk = ttk-i + 9k{bj,_i — ttk-i) 

— bk = Ofc-i 


A.1.3 Dirichlet Process Mixture models 

The difficulty here becomes the inversion of Fy, since this has no closed form. A simple Monte Carlo 
approximation can be used to approximate this inversion for each posterior sample. 

• Simulate from the partial posterior 

• Sample Z'~ Exp(A^(j) (x')) 

(7) 

• Simulate Fy from the partial posterior 

• Simulate ~ Fy ^ for fc = 1,..., A 

• Calculate := 1 — ^ 

• Set r'o' = r, 
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